Practical Synthetic Data Generation

Synthetic Data

Author

김보람

Published

March 10, 2023

Practical Synthetic Data Generation

부제: 머신러닝을 위한 실전 데이터셋
출판사: 한빛미디어
저자: 칼리드 엘 에맘, 루시 모스케라, 리처드 홉트로프
옮긴이: 심상진 옮김
소개글: 머신러닝 모델을 구축하고, 테스트를 진행하려면 크고 다양한 종류의 데이터가 필요하다. 그러나 대부분의 데이터셋은 개인 정보 문제로 사용이 제한적이라 광범위하게 사용할 수 없다. 이 책에서는 실제 데이터로 새로운 데이터를 만드는 실용적인 합성 데이터 기술을 소개한다. 합성 데이터는 이차 분석에 용이하여 데이터 연구, 고객 행동의 이해, 신제품 개발 등 다양한 목적으로 활용될 수 있다. 이 책은 실제 데이터를 합성해 다양한 산업에서 사용할 수 있는 방법을 제공하며, 개인 정보 문제를 해결하는 방법을 다룬다. 또한 실제 데이터셋에서 합성 데이터를 생성하기 위한 원칙과 단계를 배운다. 더 나아가 합성 데이터가 제품이나 솔루션 개발에 드는 시간을 어떻게 단축할 수 있는지를 학습한다.

CHAPTER 1 합성 데이터 생성 소개

- 합성 데이터 정의

실제 데이터가 아니라 실제 데이터에서 생성되어 통계속성이 동일한 데이터

- 합성 데이터 유형

실제 데이터로 합성하기
실제 데이터 없이 합성하기
두 가지 유형을 합친 하이브리드

- 합성 데이터 이점

식별 가능한 개인 데이터가 아님 \(\to\) 개인 정보 보호 규정 적용x
데이터 이차 목적 사용 가능
수집이 어렵거나 비실용적, 비윤리적인 경우도 사용 가능
초기 모델을 훈련 \(\to\) 데이터 모델의 정합화 촉진

- 합성 데이터 활용 사례

제조/유통
헬스케어
금융 서비스
교통수단

- 공공 데이터는 통제되지 않는다.

- 영국 공중보건국 합성 암 등록 데이터

CHAPTER 2 데이터 합성

- 데이터 합성이 데이터에 접근하는 최선의 방식인가?

\(\to\) 구현 프로세스 고려

- 결정기준

프라이버시, 운영 비용, 데이터 효용성, 소비자 신뢰도

CHAPTER 3 시작: 분포 적합

- 데이터 합성 방법론과 기술

분포 모델링(정규 분포/지수 분포 등 고전적인 분포)에 개별 변수를 적합시키거나 데이터 구조 모델링 사용

- 과적합 해결법

분포를 중립 지점에서 시작해 데이터에 더 가깝게 더 가까운 쪽으로 이동하여 각 단계별 분포의 단순성과 적합도 사이에서 균형을 이루게 하는 접근법
최고의 절충점에 언제 도착했는지 측정해서 방지
분포 적합성 접근법?

CHAPTER 4 합성 데이터의 효용성 평가

- 일변량

실제 데이터와 합성 데이터에서 각 변수 간의 분포 차이를 측정하기 위한 헬링거 거리 계산
0: 분포간에 차이 없음 ~ 1: 분포 차이 많음

- 이변량

실제 데이터와 합성 데이터의 모든 변수 쌍 간의 상관관계의 절대적 차이 = 데이터 효용성의 척도

- 다변량

10겹 교차 검증
데이터 셋을 10개의 동일한 크기의 서브셋으로 나눔
서브셋1을 테스트셋으로 하고 나머지 9개 데이터셋 모델을 만든다.
서브셋1에서 모델 테스트, AUROC 계산
훈련 데이터로 서브셋2 테스트 사용, AUROC 계산
… 10회 반복 \(\to\) AUROC 10개 값 \(\to\) 평균계산

- 합성데이터와 실제 데이터에 대한 대응 모델에 대한 계산

- 두 AUROC값의 절대적 차이 계산

- 모든 절대값 차이에 대한 상자 그림 생성

- 경향점수.. (어렵… 103p)

CHAPTER 5 데이터 합성 방법

- 방법

정규분포로부터 샘플링
샘플링 프로세스 중 상관관계 유도
코퓰러 사용

- 코퓰러?

확률변수들 간의 상관관계 또는 종속성을 나타내는 함수
흠.. 나중에 찾아보자..

- 머신러닝 방법 - 의사결정 트리(CART) 사용

- 딥러닝 방법 - 변이형 오토인코더(VAE) - 생성적 적대 신경망(GAN)

- GAN - 생성기: 입력 무작위 데이터, 정규 분포 또는 균일 분포로부터 샘플링하며 합성 데이터 생성 - 판별기: 합성 데이터와 실제 데이터 비교하여 유사한 경향 점수 생성 차이 결과를 생성 기 훈련을 위해 다시 제공

CHAPTER 6 합성 데이터의 신원 식별

노출 유형
신원 노출(정보 이득이 0이라면 신원 노출에 아무런 의미가 없음)
속성 노출

Synthetic Data 참고자료

(저널)합성 데이터(Synthetic data)의 부상
(저널) 합성 데이터의 시대가 오고 있다

SDV

The Synthetic Data Vault

SDMetric

!pip install sdmetrics

Collecting sdmetrics
  Downloading sdmetrics-0.9.2-py2.py3-none-any.whl (140 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.7/140.7 kB 8.7 MB/s eta 0:00:00
Requirement already satisfied: plotly<6,>=5.10.0 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from sdmetrics) (5.13.0)
Requirement already satisfied: scipy<2,>=1.5.4 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from sdmetrics) (1.7.3)
Collecting copulas<0.9,>=0.8.0
  Downloading copulas-0.8.0-py2.py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.6/53.6 kB 8.8 MB/s eta 0:00:00
Collecting scikit-learn<2,>=0.24
  Downloading scikit_learn-1.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.8/24.8 MB 71.7 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: numpy<2,>=1.20.0 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from sdmetrics) (1.21.6)
Requirement already satisfied: pandas<2,>=1.1.3 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from sdmetrics) (1.3.5)
Requirement already satisfied: tqdm<5,>=4.15 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from sdmetrics) (4.64.1)
Requirement already satisfied: matplotlib<4,>=3.4.0 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from copulas<0.9,>=0.8.0->sdmetrics) (3.5.3)
Requirement already satisfied: pytz>=2017.3 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from pandas<2,>=1.1.3->sdmetrics) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from pandas<2,>=1.1.3->sdmetrics) (2.8.2)
Requirement already satisfied: tenacity>=6.2.0 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from plotly<6,>=5.10.0->sdmetrics) (8.2.1)
Requirement already satisfied: joblib>=0.11 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from scikit-learn<2,>=0.24->sdmetrics) (1.2.0)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from matplotlib<4,>=3.4.0->copulas<0.9,>=0.8.0->sdmetrics) (1.4.4)
Requirement already satisfied: pyparsing>=2.2.1 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from matplotlib<4,>=3.4.0->copulas<0.9,>=0.8.0->sdmetrics) (3.0.9)
Requirement already satisfied: packaging>=20.0 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from matplotlib<4,>=3.4.0->copulas<0.9,>=0.8.0->sdmetrics) (23.0)
Requirement already satisfied: fonttools>=4.22.0 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from matplotlib<4,>=3.4.0->copulas<0.9,>=0.8.0->sdmetrics) (4.38.0)
Requirement already satisfied: pillow>=6.2.0 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from matplotlib<4,>=3.4.0->copulas<0.9,>=0.8.0->sdmetrics) (9.4.0)
Requirement already satisfied: cycler>=0.10 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from matplotlib<4,>=3.4.0->copulas<0.9,>=0.8.0->sdmetrics) (0.11.0)
Requirement already satisfied: six>=1.5 in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas<2,>=1.1.3->sdmetrics) (1.16.0)
Requirement already satisfied: typing-extensions in /home/koinup4/anaconda3/envs/py37/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib<4,>=3.4.0->copulas<0.9,>=0.8.0->sdmetrics) (4.4.0)
Installing collected packages: threadpoolctl, scikit-learn, copulas, sdmetrics
Successfully installed copulas-0.8.0 scikit-learn-1.0.2 sdmetrics-0.9.2 threadpoolctl-3.1.0

data

from sdmetrics import load_demo 

real_data, synthetic_data, metadata = load_demo(modality='single_table')

metadata

{'fields': {'start_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
  'end_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
  'salary': {'type': 'numerical', 'subtype': 'integer'},
  'duration': {'type': 'numerical', 'subtype': 'integer'},
  'student_id': {'type': 'id', 'subtype': 'integer'},
  'high_perc': {'type': 'numerical', 'subtype': 'float'},
  'high_spec': {'type': 'categorical'},
  'mba_spec': {'type': 'categorical'},
  'second_perc': {'type': 'numerical', 'subtype': 'float'},
  'gender': {'type': 'categorical'},
  'degree_perc': {'type': 'numerical', 'subtype': 'float'},
  'placed': {'type': 'boolean'},
  'experience_years': {'type': 'numerical', 'subtype': 'float'},
  'employability_perc': {'type': 'numerical', 'subtype': 'float'},
  'mba_perc': {'type': 'numerical', 'subtype': 'float'},
  'work_experience': {'type': 'boolean'},
  'degree_type': {'type': 'categorical'}},
 'constraints': [],
 'model_kwargs': {},
 'name': None,
 'primary_key': 'student_id',
 'sequence_index': None,
 'entity_columns': [],
 'context_columns': []}

real_data.head()

	student_id	gender	second_perc	high_perc	high_spec	degree_perc	degree_type	work_experience	experience_years	employability_perc	mba_spec	mba_perc	salary	placed	start_date	end_date	duration
0	17264	M	67.00	91.00	Commerce	58.00	Sci&Tech	False	0	55.0	Mkt&HR	58.80	27000.0	True	2020-07-23	2020-10-12	3.0
1	17265	M	79.33	78.33	Science	77.48	Sci&Tech	True	1	86.5	Mkt&Fin	66.28	20000.0	True	2020-01-11	2020-04-09	3.0
2	17266	M	65.00	68.00	Arts	64.00	Comm&Mgmt	False	0	75.0	Mkt&Fin	57.80	25000.0	True	2020-01-26	2020-07-13	6.0
3	17267	M	56.00	52.00	Science	52.00	Sci&Tech	False	0	66.0	Mkt&HR	59.43	NaN	False	NaT	NaT	NaN
4	17268	M	85.80	73.60	Commerce	73.30	Comm&Mgmt	False	0	96.8	Mkt&Fin	55.50	42500.0	True	2020-07-04	2020-09-27	3.0

synthetic_data.head()

	student_id	gender	second_perc	high_perc	high_spec	degree_perc	degree_type	work_experience	employability_perc	mba_spec	mba_perc	salary	placed	start_date	end_date	duration
0	0	F	41.361060	85.425072	Commerce	74.972674	Comm&Mgmt	False	49.986653	Mkt&Fin	57.291083	NaN	True	2020-02-11	2020-08-02	3.0
1	1	M	63.720169	99.059033	Commerce	62.769650	Others	False	78.962948	Mkt&HR	79.068319	NaN	False	NaT	NaT	NaN
2	2	M	58.473884	89.241528	Science	83.066328	Sci&Tech	True	47.980244	Mkt&Fin	77.042950	26727.0	True	2020-02-13	2020-05-27	3.0
3	3	F	77.232204	100.523788	Commerce	61.010445	Comm&Mgmt	True	61.016218	Mkt&HR	68.132991	22058.0	True	2020-09-24	2020-11-07	3.0
4	4	F	54.067830	109.611537	Commerce	72.846753	Others	True	66.949987	Mkt&Fin	66.363138	NaN	False	NaT	NaT	NaN

from sdmetrics.reports.single_table import QualityReport

report = QualityReport()
report.generate(real_data, synthetic_data, metadata)

Creating report: 100%|██████████| 4/4 [00:00<00:00, 12.97it/s]


Overall Quality Score: 81.44%

Properties:
Column Shapes: 81.56%
Column Pair Trends: 81.33%

report.get_details(property_name='Column Shapes')

	Column	Metric	Quality Score
0	second_perc	KSComplement	0.627907
1	high_perc	KSComplement	0.553488
2	degree_perc	KSComplement	0.627907
3	experience_years	KSComplement	0.800000
4	employability_perc	KSComplement	0.781395
5	mba_perc	KSComplement	0.841860
6	salary	KSComplement	0.869155
7	start_date	KSComplement	0.701107
8	end_date	KSComplement	0.768919
9	duration	KSComplement	0.826051
10	gender	TVComplement	0.939535
11	high_spec	TVComplement	0.902326
12	degree_type	TVComplement	0.925581
13	work_experience	TVComplement	0.972093
14	mba_spec	TVComplement	0.995349
15	placed	TVComplement	0.916279

시각화

report.get_visualization(property_name='Column Shapes')

- high Quality

get_column_plot?

Signature: get_column_plot(real_data, synthetic_data, column_name, metadata)
Docstring:
Return a plot of the real and synthetic data for a given column.
Args:
    real_data (pandas.DataFrame):
        The real table data.
    synthetic_data (pandas.DataFrame):
        The synthetic table data.
    column_name (str):
        The name of the column.
    metadata (dict):
        The table metadata.
Returns:
    plotly.graph_objects._figure.Figure
File:      ~/anaconda3/envs/py37/lib/python3.7/site-packages/sdmetrics/reports/utils.py
Type:      function

from sdmetrics.reports.utils import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    column_name='mba_spec'
)

fig.show()

- low Quality

from sdmetrics.reports.utils import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    column_name='second_perc'
)

fig.show()

report.get_visualization(property_name='Column Pair Trends')

from sdmetrics.reports.utils import get_column_pair_plot

get_column_pair_plot?

Signature: get_column_pair_plot(real_data, synthetic_data, column_names, metadata)
Docstring:
Return a plot of the real and synthetic data for a given column pair.
Args:
    real_data (pandas.DataFrame):
        The real table data.
    synthetic_column (pandas.Dataframe):
        The synthetic table data.
    column_names (list[string]):
        The names of the two columns to plot.
    metadata (dict):
        The table metadata.
Returns:
    plotly.graph_objects._figure.Figure
File:      ~/anaconda3/envs/py37/lib/python3.7/site-packages/sdmetrics/reports/utils.py
Type:      function

fig = get_column_pair_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    column_names=['start_date', 'second_perc']
)

fig.show()

데이터 보고서 형태로 저장

report.save(filepath='sdmetrics_quality_demo.pkl')

# load the report at a later time
report = QualityReport.load(filepath='sdmetrics_quality_demo.pkl')